Overview

The goal of this study is to investigate the integration of semantic and pragmatic information during word learning in children between 2 and 5 years of age. As a first step, we replicate earlier work showing that children rely on different forms of pragmatic and semantic information during word learning. In experiment 1, we show that children make a so called mutual exclusivity inference and that this inference depends on children’s developing semantic knowledge. In a second experiment, we show that children make inferences about word meanings based on common ground. In a third experiment, we find that, when combined in one procedure, children are sensitive to the way that the two inferences are aligned.

Next, we introduce a computational framework which we use to formalize the process of how the inferences are integrated. As a part of this, we identify three information sources that children consider when making the alleged inferences: semantic knowledge, expectations about speaker informativeness and sensitivity to common ground. We then use the our modelling framework to ask which of these information sources are necessary to predict children’s responses in experiment 3.

In the final section, we turn to the process by which information is integrated. We contrast the process of rational Bayesian inference we introduced in our model with a biased integration process in which one type of inference is given more weight. As part of this, we also explore alternative ways to think about developmental change in the integration process.

Empirical studies

Experiment 1: Mutual exclusivity

The first experiment tested the so called mutual exclusivity inference in children between 2 and 5 years of age. The general phenomena is that when presented with a familiar and an unfamiliar object, children expect a novel word to refer to the unfamiliar object (e.g. Markman and Wachtel 1988). A range of explanations have been put forward for the cognitive basis of this inference (see Lewis et al. 2020 for a discussion). Here, we treat the mutual exclusivity inference as pragmatic (e.g. Clark 1987). The inference process is specified in the model below.

The first goal of this experiment was to quantify developmental change in the age range tested. The second goal was to test the role of semantic knowledge (cf. Lewis et al. 2020). The assumption is that the strength of the mutual exclusivity inference varies with knowledge of the word for the familiar object. That is, when the familiar object is an object for which children are less likely to know the word, they are less likely to assume that the novel word refers to the unfamiliar object. To test this, we systematically varied the familiar object that were presented alongside the novel object.

The experiment was preregistered at https://osf.io/gy37b. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_me.html.

Participants

We tested a total number of 90 children, including 30 2-year-olds (range = 2.03 - 3.00, 15 girls), 30 3-year-olds (range = 3.03 - 3.97, 22 girls) and 30 4-year-olds (range = 4.03 - 4.90, 16 girls). Data from 10 additional children was not included because they were either exposed to less than 75% of English at home (5), did not finish at least half of the test trials (2), the technical equipment failed (2) or their parents reported an autism spectrum disorder (1). All children were recruited from the floor of a Children’s museum in San José, California, USA. This population is characterized by diverse ethnic background (predominantly White, Asian, or mixed ethnicity) and high levels of parental education and socioeconomic status. Parents consented to their children’s participation and provided demographic information. All experiments were approved by the Stanford Institutional Review Board (protocol no. 19960).

Procedure

The experiment was presented as an interactive picture book on a tablet computer (Frank et al. 2016). Figure 1A shows the general setup. Children saw an animal standing on a little hill between two tables. For each animal character, we recorded a set of utterances (one native English speaker per animal) that were used to make requests. Each experiment started with two training trials in which the speaker requested known objects (car and ball).

In experiment 1, on one table, there was a familiar object, on the other table, there was a novel object (drawn for the purpose of the study). The speaker requested an object by saying “Oh cool, there is a [non-word] on the table, how neat, can you give me the [non-word]?”. Children responded by touching one of the objects. The location of the novel object (left or right table) and the animal character were counterbalanced. Each child received 12 trials, one with each familiar object. The novel object also changed from trial to trial. We coded as correct choice if children chose the novel object as the referent of the novel word.

Schematic experimental procedure with screenshots from the experiments.

Figure 1: Schematic experimental procedure with screenshots from the experiments.

Each child completed 12 trials, each with a different familiar and a different novel object. Familiar objects were selected to vary along the dimension of how likely children were to know the word for each object. This including objects that most 2-year-olds could name (e.g. a duck) as well as objects that only very few 5-year-olds could name (e.g. a pawn). The selection was based on age of acquisition ratings from Kuperman and colleagues (2012). While these ratings do not capture the absolute age when children acquire these words, they capture the relative order in which words are learned. Figure 2A shows the objects used in the experiment. We induced this variation to estimate the role of semantic knowledge in a mutual exclusivity inference.

Results

chance_me <- me_data %>%
  group_by(subage, subid) %>%
  summarise(correct = mean(correct)) %>%
  summarise(correct = list(correct)) %>%
  group_by(subage)%>%
  mutate(Mean= round(mean(unlist(correct)),2),
         BayesFactor = format(round(extractBF(ttestBF(unlist(correct), mu = 0.5))$bf), scientific = F),
         `Age group` = subage)%>%
  ungroup()%>%
  select(`Age group`, Mean, BayesFactor)

knitr::kable(chance_me, caption = "Proportion of children choosing the novel object compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across familiar objects.", digits = 2, align = "l")
Table 1: Proportion of children choosing the novel object compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across familiar objects.
Age group Mean BayesFactor
2 0.61 132
3 0.73 185881356
4 0.86 72514087738

As a first step, we evaluated whether children made a mutual exclusivity inference. For this analysis, we aggregated participants’ responses across familiar objects. We used the function ttestBF from the R-package BayesFactor (Morey and Rouder 2018) to compute a Bayes factor (BF) in favor of the hypothesis that children chose the novel object more often than expected by chance (50% correct). Table 1 shows that all age groups made the inference.

# prior_me <- c(prior(normal(0, 5), class = Intercept),
#            prior(normal(0, 5), class = b),
#            prior(cauchy(0, 1), class = sd))
# 
# 
# bm_me <- brm(correct ~ age + (1|subid) + (age | item) + (age | agent),
#                     data = me_data, family = bernoulli(),
#           control = list(adapt_delta = 0.99, max_treedepth = 20),
#           sample_prior = F,
#           prior = prior_me,
#           cores = 4,
#           chains = 4,
#           iter = 5000)%>%
#   saveRDS(.,"../saves/bm_me.rds")
# 
# 
# bm_me2 <- brm(correct ~ age + (1|subid) + (age | agent),
#                     data = me_data, family = bernoulli(),
#           control = list(adapt_delta = 0.99, max_treedepth = 20),
#           sample_prior = F,
#           prior = prior_me,
#           cores = 4,
#           chains = 4,
#           iter = 5000)%>%
#   saveRDS(.,"../saves/bm_me.rds")

bm_me <- readRDS("../saves/bm_me.rds")

bm_me2 <- readRDS("../saves/bm_me2.rds")

As a second step, we investigated how the inference changed as a function of age and the familiar object. We modeled the trial by trial data using a Bayesian generalized linear mixed model (GLMM). We used the function brm from the package brms (Bürkner 2017). We pre-registered the use of default priors in all models. However, the model in Experiment 3 was unable to initialize with default priors and we thus used weakly informative priors for all models to be consistent. The priors we used were normal(0,5) for fixed effects and cauchy(0,1) for standard deviations of random effects. The model formula was correct ~ age + (1 | id) + (age | object) + (age | agent). That is, we modeled an overall slope for age (continuous, anchored at the minimum) and the object specific developmental trajectories as deviations from the overall intercept and slope (random effects). We did not pre-register agent as a random effect, but retrospecitvely included it to be consistent with Experiment 2 and 3.

The estimate for age was positive and reliably different from zero (\(\beta\) = 0.91, 95% CrI: 0.58 - 1.3). Older children were more likely to make a mutual exclusivity inference. To assess the variability accross objects, we compared the fit of the above model to a model lacking object as a random effect. Following McElreath (2016), we compared models using WAIC (widely applicable information criterion) scores and weights. The WAIC score is an indicator of the model’s predictive accuracy for out of sample data; model’s with lower scores are preferred. WAIC weights are an estimate of the probability that this model (compared to all other models considered) will make the best predictions on new data. The model including object provided a much better fit compared to the model lacking it (see Table 2). Figure 2B visualizes the model based developmental trajectory for each familiar object and illustrates the substantial variation between them, both in terms of absolute strength of the inference as well as its developmental trajectory. Figure 2C shows the correlation between rated age of acquisition and object specific model intercept. The mutual exclusivity effect was stronger for words that were rated to be acquired earlier. Objects for which children were less likely to know the word produced a weaker mutual exclusivity effect. Taken together, the strength of the mutual exclusivity inference depended on age as well as the familiar object.

me_waic <- brms::waic(bm_me, bm_me2, compare = F)

me_weights <- model_weights(bm_me, bm_me2, weights = "waic")

me_comp <- tibble(
  Model = c("with object as RE", "without object as RE"),
  WAIC = round(c(me_waic$loos$bm_me$estimates["waic","Estimate"],me_waic$loos$bm_me2$estimates["waic","Estimate"]),2),
  SE = round(c(me_waic$loos$bm_me$estimates["waic","SE"],me_waic$loos$bm_me2$estimates["waic","SE"]),2),
  weight = round(c(me_weights[1],me_weights[2]))
)

knitr::kable(me_comp, caption = "Model comparison in experiment 1 based on WAIC scores and weights.", digits = 2, align = "l")
Table 2: Model comparison in experiment 1 based on WAIC scores and weights.
Model WAIC SE weight
with object as RE 1089.05 32.19 1
without object as RE 1202.46 31.28 0
A:Familiar words and corresponding pictures by rated age of acquisition. B: Developmental trajectories of mututal exclusivity effect by familiar object based on the mean of the model posterior distribution. Dots show individual datapoints. Lighter colors indicate later rated age of acquisition. Dotted line indicates a level of performance expected by chance. C: Correlation between rated age of acquisiton and mutual exclusivity effect (model based intercept for each familiar object).

Figure 2: A:Familiar words and corresponding pictures by rated age of acquisition. B: Developmental trajectories of mututal exclusivity effect by familiar object based on the mean of the model posterior distribution. Dots show individual datapoints. Lighter colors indicate later rated age of acquisition. Dotted line indicates a level of performance expected by chance. C: Correlation between rated age of acquisiton and mutual exclusivity effect (model based intercept for each familiar object).

Experiment 2: Common ground

Here we tested children’s sensitivity to common ground that is build up over the course of a conversation. In particular, we tested whether children keep track of which object is new to a speaker and which they have encountered previously (Akhtar, Carpenter, and Tomasello 1996; Diesendruck et al. 2004). The main goal of the experiment was to measure how children’s sensitivity to common ground changes with age.

The experiment was preregistered at https://osf.io/au5hr. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_novel.html.

Participants

We tested 58 children from the same general population as in Experiment 1, including 18 2-year-olds (range = 2.02 - 2.93, 7 girls), 19 3-year-olds (range = 3.01 - 3.90, 14 girls) and 21 4-year-olds (range = 4.07 - 4.93, 14 girls). Data from 5 additional children was not included because they were either exposed to less than 75% of English at home (3) or the technical equipment failed (2).

Procedure

The general setup was the same as in Experiment 1. The speaker was positioned between the tables. There was a novel object (drawn for the purpose of the study) on one of the tables while the other table was empty. Next, the speaker turned to one of the tables and either commented on the presence (“Aha, look at that.”) or the absence (“Hm, nothing there”) of an object. Then the speaker disappeared. While the speaker was away, a second novel object appeared on the previously empty table. Then the speaker returned and requested an object in the same way as in Experiment 1 (see also Figure 1B). The positioning of the novel object in the beginning of the experiment as well as the location the speaker turned to first was counterbalanced. Children received five trials, each with a different pair of novel objects. We coded as correct choice if children chose the object that was new to the speaker as the referent of the novel word.

Results

chance_prior <- prior_data %>%
  group_by(subage, subid) %>%
  summarise(correct = mean(correct)) %>%
  summarise(correct = list(correct)) %>%
  group_by(subage)%>%
  mutate(Mean= round(mean(unlist(correct)),2),
         BayesFactor = format(round(extractBF(ttestBF(unlist(correct), mu = 0.5))$bf,2), scientific = F),
         `Age group` = subage)%>%
  ungroup()%>%
  select(`Age group`, Mean, BayesFactor)

knitr::kable(chance_prior, caption = "Proportion of children choosing the object that was new to the speaker compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across trials.", digits = 2, align = "l")
Table 3: Proportion of children choosing the object that was new to the speaker compared to a level expected by chance based on a one sample Bayesian t-test. Responses are aggregated for each participant across trials.
Age group Mean BayesFactor
2 0.55 0.4
3 0.76 26.55
4 0.83 6956.06

Table 3 compares children’s correct responses to a level expected by chance (50%). We found evidence that, as a group, 3- and 4-year-olds, but not 2-year-olds, inferred that the novel word referred to the object that was new to the speaker.

# prior_cg <- c(prior(normal(0, 5), class = Intercept),
#            prior(normal(0, 5), class = b),
#            prior(cauchy(0, 1), class = sd))
# 
# 
# bm_cg <- brm(correct ~ age + (1|subid) + (age | agent),
#                     data = prior_data, family = bernoulli(),
#           control = list(adapt_delta = 0.99, max_treedepth = 20),
#           sample_prior = F,
#           prior = prior_cg,
#           cores = 4,
#           chains = 4,
#           iter = 5000)%>%
#   saveRDS(.,"../saves/bm_cg.rds")

bm_cg <- readRDS("../saves/bm_cg.rds")

To directly investigate whether children’s response changed with age, we modeled the trial by trial data using a Bayesian GLMM (formula: correct ~ age + (1 | id) + (age | speaker), specifications see Experiment 1). The estimate for age was positive and reliably different from zero (\(\beta\) = 0.92, 95% CrI: 0.37 - 1.54, see Figure 3A). Older children were more likely to chose the object that was new to the speaker as the referent of the novel word, suggesting that the sensitivity to common ground in this context increases with age.

Experiment 3: Integration

Experiment 3 combined the procedures from Experiment 1 and 2. As a consequence, children had to consider not just their semantic knowledge of the word for the familiar object and the inference this licences but also the role that each object (novel and familiar) had played in the preceding interaction. Combining the two procedures created two conditions: In the congruent condition, the novel object was also the object that was new to the speaker. In this case, the mutual exclusivity inference as well as the common ground inference pointed to the novel object as the referent. In the incongurent condition, the familiar object was new to the speaker. Int his case, the two inferences pointed to different objects. The main focus of the overall study was to model how children integrate and balance these different information sources. We investigate this question in depth in the modelling section below. Here, we limit the discussion to whether children differentiated between the two conditions.

The experiment was preregistered at https://osf.io/4nm8g. The experiment itself can be run by downloading the associated repository and opening the file experiments/kids/kids_combination.html.

Participants

We tested 220 children from the same general population as in Experiment 1 and 2, including 76 2-year-olds (range = 2.04 - 2.99, 7 girls), 72 3-year-olds (range = 3.00 - 3.98, 14 girls) and 72 4-year-olds (range = 4.00 - 4.94, 14 girls). Data from 20 additional children was not included because they were either exposed to less than 75% of English at home (15), did not finish at least half of the test trials (3) or the technical equipment failed (2).

Procedure

Experiment 3 followed the same procedure as Experiment 2 but involved the same objects as Experiment 1. In the beginning, one table was empty while there was an object (novel or familiar) on the other one. After commenting on the presence or absence of an object on each table, the speaker disappeared and a second object appeared (familiar or novel). Next, the speaker re-appeared and made the usual request.

In the congruent condition, the familiar object was present in the beginning and the novel object appeared while the speaker was away (Figure 1C - left). In this case, both the mutual exclusivity and the common ground inference pointed to the novel object as the referent. In the incongruent condition, the novel object was present in the beginning and the familiar object appeared later. In this case, the two inferences pointed to different objects (Figure 1C - right).

Participants received up to 12 test trials, six in each condition, each with a different familiar and novel object. Familiar objects were the same as in Experiment 1. The positioning of the objects on the tables and the location the speaker first turned to were counterbalanced. Participants could stop the experiment after six trials (three per condition). If a participant stopped after half of the trials, we tested an additional participant to reach a pre-registered number of data points per cell.

Results

All results are reported from the perspective of the mutual exclusivity inference (correct in the model formula below). In the incongruent condition, high proportions speak to a mutual exclusivity inference and low proportion for a common ground inference. In the congruent condition, both inferences pointed in the same direction. The focus of this experiment was on information integration and we therefore did not compare the performance to chance.

# prior_comb <- c(prior(normal(0, 5), class = Intercept),
#            prior(normal(0, 5), class = b),
#            prior(cauchy(0, 1), class = sd))
# 
# bm_comb <- brm(correct ~ age * alignment + (alignment | subid) + (age * alignment | item)+ (age * alignment | agent),
#                     data = comb_data, family = bernoulli(),
#           control = list(adapt_delta = 0.99, max_treedepth = 20),
#           sample_prior = F,
#           prior = prior_comb,
#           cores = 4,
#           chains = 4,
#           inits = 0,
#           iter = 5000)%>%
#   saveRDS(.,"../saves/bm_comb.rds")
# 
# bm_comb2 <- brm(correct ~ age * alignment + (alignment | subid) + (age * alignment | agent),
#                     data = comb_data, family = bernoulli(),
#           control = list(adapt_delta = 0.99, max_treedepth = 20),
#           sample_prior = F,
#           prior = prior_comb,
#           cores = 4,
#           chains = 4,
#           inits = 0,
#           iter = 5000)%>%
#   saveRDS(.,"../saves/bm_comb2.rds")

bm_comb <- readRDS("../saves/bm_comb.rds")

bm_comb2 <- readRDS("../saves/bm_comb2.rds")

We modeled the trial by trial data in the following way: correct ~ age * alignment + (alignment | subid) + (age * alignment | item) + (age * alignment | agent). We pre-registered to include item as a fixed effect in Experiment 3. The corresponding model was too complex to be constrained by the data. Furthermore, as explained in Experiment 1, items were chosen based on their rated age of acquisition. That is, we assumed that they are not necessarily different kinds but that they represent different locations on a distribution of required semantic knowledge. For further model specifications see Experiment 1).

The estimate for age was reliably positive (\(\beta\) = 0.81, 95% CrI: 0.4 - 1.24). The incongruent condition had a strong negative impact (\(\beta\) = -1.35, 95% CrI: -2.17 - -0.55), showing that children differentiated between the two conditions. The interaction term was weakly - though not entirely - negative, suggesting a shallower slope for age in the incongruent condition (\(\beta\) = -0.2, 95% CrI: -0.66 - 0.27). A model lacking object as a random effect provided a much poorer fit, suggesting substantial variation across objects (see Table 4). Figure 3B visualizes the model. Taken together, the results show that children responded to the way the two inferences were aligned with one another.

comb_waic <- brms::waic(bm_comb, bm_comb2, compare = F)

comb_weights <- model_weights(bm_comb, bm_comb2, weights = "waic")

comb_comp <- tibble(
  Model = c("with object as RE", "without object as RE"),
  WAIC = round(c(comb_waic$loos$bm_comb$estimates["waic","Estimate"],comb_waic$loos$bm_comb2$estimates["waic","Estimate"]),2),
  SE = round(c(comb_waic$loos$bm_comb$estimates["waic","SE"],comb_waic$loos$bm_comb2$estimates["waic","SE"]),2),
  weight = round(c(comb_weights[1],comb_weights[2]))
)

knitr::kable(comb_comp, caption = "Model comparison in experiment 3 based on WAIC scores and weights.", digits = 2, align = "l")
Table 4: Model comparison in experiment 3 based on WAIC scores and weights.
Model WAIC SE weight
with object as RE 2188.59 46.97 1
without object as RE 2390.01 44.28 0
Proportion of choosing the object that was new to the speaker by age. Dots show the mean response for each participant. The solid black line shows the developmental trajectory based on the mean of the model posterior distribution. Lighter lines show 200 random draws from the posterior distribution to depict uncertainty. Dotted line indicates a level of performance expected by chance.

Figure 3: Proportion of choosing the object that was new to the speaker by age. Dots show the mean response for each participant. The solid black line shows the developmental trajectory based on the mean of the model posterior distribution. Lighter lines show 200 random draws from the posterior distribution to depict uncertainty. Dotted line indicates a level of performance expected by chance.

Discussion

The experiments reported above show that children are sensitive to the types of information sources we intended to manipulate. Experiment 1 showed that children of all age groups are make a mutual exclusivity inference, that the strength of this inference increases with age and, crucially, that it depends on the type of familiar object that is presented. Experiment 2 showed that three- and four-year-olds are sensitive the common ground manipulation we implemented and that this inference increases with age. Finally, experiment 3 showed that children respond differently depending on how the mutual exclusivity inference and the common ground inference are aligned with one another. In the next section, we use Bayesian cognitive models to address the question of how information sources are integrated when the two inferences are combined with one another.

Models

The main purpose of the study was to study how children integrate different information sources during word learning and how this process develops with age. To do so, we use Bayesian cognitive models of pragmatic reasoning. We first describe an integration model which we think best represents the inference and integration processes and then specify how this model captures developmental change. Next, we ask how well this model predicts how children integrate the two information sources. That is, in a situation in which we know the development trajectories for the mutual exclusivity inference as well as for the common ground inference, what can we say about what happens when they are combined. We then test the predictive power of the model by comparing the model predictions to the data from experiment 3. We will use model comparisons to test the integration model against a range of alternative models.

Finally, we ask how well our model explains the way that children integrate the two information sources. For this analysis, we fit the free parameters in the model to all the available data, those from experiment 1 and 2 as well as the integration data from experiment 3. We then compare the model to a range of alternative models that make different assumptions about how information is integrated and how this process develops. This approach answers the question of how we can best explain how children are integrating the different information sources.

Modeling framework

The cognitive models are situated in the Rational Speech Act (RSA) framework (Frank and Goodman 2012; Goodman and Frank 2016). RSA models are models of pragmatic reasoning in that they treat language understanding as a special case of Bayesian social reasoning. A listener interprets an utterance by assuming it was produced by a cooperative speaker who had the goal to be informative. Being informative is defined as providing a message that would increase the probability of the listener inferring the speaker’s intended message. This notion of contextual informativeness captures the Gricean idea of cooperation between speaker and listener.

The model captures the following process. A listener is reasoning about the referent of a speaker’s utterance while at the same time trying to learn a lexicon (object–word mappings). This reasoning is contextualized by the prior probability of each referent. This prior probability is thought to be a function of the common ground shared between speaker and listener in that interacting around the objects changes the probability that they will be referred to later. We assume that the degree to which interactions around objects change their prior probability depend on the child’s age.

To decide between referents, the listener reasons about what a rational speaker would say given an intended referent. This speaker is assumed to compute the informativity of for each available utterance and then choose the most informative one. However, this expectation of speaker informativeness may vary and is captured by the term alpha. In particular, we take alpha to be a function of the child’s age.

The informativity of each utterance is given by imagining which referent a literal listener, who interprets words according to their literal semantics, would infer upon hearing it. Thus, this reasoning depends what kind of word–object mappings the speaker thinks the literal listener knows. We assume that knowing the literal semantics of a word is not deterministic but probabilistic. That is, for each object involved there is a probability p that the literal listener knows the word for it. For each of the novel objects, this semantic knowledge is 0. For familiar objects, it depends on the kind of object as well as on the child’s age.

Loci of development

The model description above points to three potential loci of developmental change: semantic knowledge, expectations about speaker informativeness and common ground sensitivity. Each of theses components is represented by a parameter in the model. We capture developmental change by making these parameters a function of age and therefore estimating a developmental trajectory (intercept and slope) for each parameter.

Semantic knowledge

Semantic knowledge captures the degree of certainty with which the naive listener is assumed to know the label for the familiar object. When faced with the task, we think that children take their own semantic knowledge as the basis. As a consequence, semantic knowledge differs between familiar objects. For objects whose labels are generally acquired earlier (e.g. carrot) semantic knowledge is high whereas for others (e.g. pawn) semantic knowledge is lower. However, semantic knowledge also varies with age in that older children are more likely to know the labels for more of the familiar objects compared to younger children. As a consequence, each familiar object has a unique developmental trajectory with respect to semantic knowledge. Technically, the object specifics parameters are estimated in the form of a random effects model, that is, each object’s trajectory is estimated as a deviation from an overall trajectory for semantic. This overall trajectory represent an increase in semantic knowledge that is independent of a particular familiar object.

Semantic knowledge enters the likelihood term of the model specified above. The likelihood depends not just on the parameter settings for semantic knowledge but also on the value of the parameter capturing expectations about speaker informativeness (\(\alpha\) - see next section). As a consequence, these parameters are jointly estimated and co-vary with one another.

Expectations about speaker informativeness

A second locus of developmental change are expectations about speaker informativeness. In the context of the model, speaker informativeness corresponds to the degree with which the listener expects the speaker to choose the most informative of all available utterances. We assume that children at different ages might have different expectations about how rational or informative speakers are. As mentioned above, this parameter jointly estimated with the parameters for semantic knowledge.

Sensitivity to common ground

Sensitivity to common ground refers to the probability that an object is taken to be the referent of the utterance before actually hearing the utterance. Thus, it captures the salience of an object due to its role in the social interaction that precedes the utterance. We expect children at different ages to respond differently to the common ground manipulation, resulting in an age specific prior distribution over objects.

Model fitting

All Bayesian cognitive models were implemented in the probabilistic programming language WebPPL (Goodman and Stuhlmüller 2014). The corresponding model code can be found in the associated online repository and includes information about the prior distributions for all parameters (file xxxxx). To generate model predictions, we estimated age sensitive parameter distributions for semantic knowledge (by familiar object), speaker informativeness and common ground sensitivity and then passed them through the model in line with the different ways in which they can be combined and aligned. The resulting predictions come in the form of distributions of developmental trajectories for each object in the congruent and the incongruent condition.

Prediction

In this section we evaluate different models in terms of how well they predict information integration. That is, in a situation in which we know the development of the mutual exclusivity inference as well as the common ground inference, which model best predicts what happens when the two are combined (combination data from Experiment 3). Asking about “pure” prediction automatically excludes all models which include parameters that need to be fit to the combination data itself (e.g. mixture model, see below). To generate these predictions, we estimated the model parameters for semantic knowledge and speaker informativeness based on Experiment 1 and the parameter for common ground sensitivity based on Experiment 2.

To estimate the parameters for semantic knowledge and speaker informativeness, we adapted the model described above to a situation in which both objects (novel and familiar) have equal prior probability. We used the data from experiment 1 to then infer the parameters. That is, we inferred which intercepts and slopes for speaker informativeness and semantic knowledge would generate model predictions that corresponded to the average proportion of correct responses measured in experiment 1. To estimate the parameters representing sensitivity to common ground, we used a simple logistic regression to infer which combination of intercept and slope would generate predictions that corresponded to the average proportion of correct responses measured in experiment 1.

To estimate the parameter distributions, we collected samples from six independent MCMC chains, collecting 150 000 samples from each chain and removing the first 50 000 for burn-in. We removed samples from one chain because it converged on a local maximum which resulted in parameter distributions that were substantially different from the other chains. The model outputs can be found in the following online repository: git large file storage.

Next, we combined the parameters according to the four models described below. Please note that the parameter distributions were the same for all models (see Figure 4) and that models only differed in which parameters they included. The models described below are a full model (integration model) and three lesion models, which selectively omit one type of information. The following model comparison therefore asks which types of information are necessary to make good predictions about how information is integrated. We do not compare models that make different assumptions about how information is integrated. We turn to this question in the explanation section.

Developmental trajectories for model parameters based on the posterior distribution for (A) semantic knowlede, (B) speaker informativeness and (C) prior sensitivity. Solid lines in show the MAP estimate for each parameter. Lighter lines in (B) and (C) show 300 random draws from the posterior distributon to visualize uncertainty. (A) does not include these random draws for the sake of clarity.

Figure 4: Developmental trajectories for model parameters based on the posterior distribution for (A) semantic knowlede, (B) speaker informativeness and (C) prior sensitivity. Solid lines in show the MAP estimate for each parameter. Lighter lines in (B) and (C) show 300 random draws from the posterior distributon to visualize uncertainty. (A) does not include these random draws for the sake of clarity.

Models

Integration model

The integration model serves as the full model and takes in all available information. That is, it takes in object specific semantic knowledge, speaker informativeness and common ground sensitivity and combines these components by way of the process described above. Figure 5 visualizes the corresponding model predictions in comparison to the data from experiment 3.

Predicting information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left o right). Dashed black lines show smoothed conditional mean of the data with 95\% CI (in grey). Light dots are individual data points.

Figure 5: Predicting information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left o right). Dashed black lines show smoothed conditional mean of the data with 95% CI (in grey). Light dots are individual data points.

No word knowledge model

This model is the first lesion model. It takes in speaker informativeness and common ground sensitivity as well as general semantic knowledge, but omits semantic knowledge that is specific to the familiar objects. We described above that the parameters for semantic knowledge are fitted via a mixed effects model. On the one hand, there is an overall developmental trajectory for semantic knowledge (main effect) and then there is object specific variation around this trajectory (random effects). The no word knowledge model takes in the overall trajectory for semantic knowledge but ignores object specific variation. That is, the model assumes a listener whose mutual exclusivity inference does not vary depending on the particular familiar object but only depends on the average semantic knowledge.

No common ground model

This model takes in object specific semantic knowledge and speaker informativeness but ignores common ground. Thus, the prior distribution over objects \(P(r)\) in the model described above is flat (e.g. [0.5, 0.5]. This corresponds to a listener who only focuses on the mutual exclusivity inference and ignores the common ground manipulation. As a consequence the listener does not differentiate between the two alignment conditions.

No mutual exclusivity model

The last lesion model only takes in the common ground sensitivity. This corresponds to a listener who only focuses on common ground and ignores the identity of the objects on the tables as well as any inferences their semantic knowledge of the familiar objects license. The model predictions therefore correspond to the prior distribution over objects \(P(r)\).

Model comparison

We compared the models mentioned above in two ways. On the one hand, we used correlations between model predictions and the data. For this analysis, we binned the model predictions and the data by year and familiar object. Figure 6 visualizes the correlation between model predictions and the data for all models. The results shows a very high correlation between the predictions of the integration model and the data in all age groups suggesting that the model accurately captures the variation in the data. The correlation increased from 2- to 3-year-olds but then again dropped for 4-year-olds. This is probably a consequence of the model making very extreme predictions for the oldest age group. Correlations for the integration model were also higher compared to the other models considered.

On the other hand, we compared models based on Bayes factors which were calculated from the marginal likelihoods of each model given the data. For this analysis, we treated age continuously. We first calculated the marginal log-likelihood for each model and then computed Bayes factors by subtracting the log-likelihoods for the models involved in the comparison and then exponentiating the result (see file model_comparison.Rmd in the associated online repository). Table 5 lists the Bayes factors for the different model comparisons. The results show that the integration model, by far, outperformed all the other models. When comparing the lesion models among each other, we see that models including the mutual exclusivity inference make better predictions compared to the no mutual exclusivity model.

Taken together, this analysis showed two things. First, the integration model makes accurate predictions about how mutual exclusivity and common ground inferences are integrated with one another. It does so based on knowing the strength and development of each inference alone. Second, models that omit one or more type of information (object specific word knowledge, speaker informativeness, common ground sensitivity) make much worse predictions. This exemplifies that children across the entire age range flexibly integrate all the available information. In the next section we ask whether there are other ways think about the process of information integration than the way formalized in the integration model.

Predicting information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95\% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.

Figure 6: Predicting information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.

Table 5: Model comparison using Bayes factors computed based on the marginal likelihood of each model given the data.
Model comparison Bayes factor
integration vs no word knowledge 1.2e+14
integration vs no common ground 7.0e+14
integration vs no mutual exclusivity 7.9e+40
no word knowledge vs no mutual exclusivity 5.6e+00
no word knowledge vs no common ground 6.3e+26
no mutual exclusivity vs no common ground 1.1e+26

Explanation

In this section we explore how we can best explain information integration across development. We explore alternative ways to think about information integration. The integration model outlined above operates via Bayesian inference in that the prior probability of a referent - a consequence of the common ground manipulation - is updated upon hearing the utterance and computing the mutual exclusivity inference. In essence, this model is a multiplicative model because the probability of each referent is proportional to the product of the likelihood and the prior. The alternative model we consider here is an additive model. This mixture model assumes that the two inferences are computed separately and then weighted by some ratio \(\phi\). Psychologically, \(\phi\) may be interpreted as a bias for one type of inference relative to the other.

With respect to developmental change, the integration model assumes that the process by which information is integrated remains the same across development. What changes is children’s semantic knowledge, their expectations about speaker informativeness and their sensitivity to common ground. But the way these information sources are combined remains the same across age. As an alternative, we explore a developmental mixture model. This model makes the same assumptions about developmental change in the individual information sources but, in addition, also assumes that the way that the the mutual exclusivity and the common ground inference are combined with one another changes over time. It is structurally identical to the mixture model but the mixture parameter \(\phi\) is a function of age.

In this section, we make use of all the available data. We use the data from experiment 1 and 2 to compute prior distributions over model parameters which are subsequently updated depending on how well the model predictions they generate fit the data.

Integration model

The integration model in this section differs from the same model in the prediction section only in the distributions for the parameters representing the different information sources. Figure 10 - 12 in the appendix show how the parameter distributions differ between the prediction and the explanation version of the integration model. As a quick summary, we can say that semantic knowledge and speaker informativeness have similar distributions when taking into account all the data compared to when estimating these parameters only based on experiment 1 and 2. In contrast, the intercept for common ground sensitivity is estimated to be larger and the slope shallower. The code to run the model can be found in the associated online repository (file: xxxxxxx). Figure 7 shows model predictions for the integration model in comparison to the data from experiment 3.

Explaining information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left o right). Dashed black lines show smoothed conditional mean of the data with 95\% CI (in grey). Light dots are individual data points.

Figure 7: Explaining information integration across development. Model predictions based on the integration model. Colored lines show developmental trajectories for each familiar object and condition based on 300 random draws from the model posterior distribution. Top row (blue) shows the congruent condition and the bottom row (red) shows the inconguent condition. Familiar objects are ordered based on their rated age of acquisition (left o right). Dashed black lines show smoothed conditional mean of the data with 95% CI (in grey). Light dots are individual data points.

Mixture model

In the mixture model the two inferences are computed in the same way as in in the integration model. Subsequently, they are weighted by the mixture parameter \(\phi\):

\[\begin{equation} to be adjusted .... P_{L_1}(r, \mathcal{L}|u) = \phi P_{S_1}(u|r_{t}, \mathcal{L}) + (1-\phi) P( \mathcal{L})P(r) \end{equation}\]

To estimate \(\phi\) as well as the other parameters, we make use of all the available data. The model code can be found in the associated online repository (file: xxxxxxx). The posterior distribution for the mixture parameter \(\phi\) is shown in figure 8A. It suggests that the mutual exclusivity inference is weighted as slightly more important compared to the common ground inference.

Developmental mixture model

For this model, we make the mixture parameter \(\phi\) a function of age and estimate the intercept and slope that yield the best model predictions compared to the data from experiment 3. Figure 8B visualizes the developmental trajectory of the mixture parameter. Based on this model, the common ground inference seems to decrease in importance compared to the mutual exclusivity inference with age.

Mixture component for the mixture model (A) and the developmental mixture model (B). (A) shows the posterior distribution of the mixture component and (B) shows developmental trajectories for the mixture component based on 300 random draws from the posterior distribution for intercept and slope.

Figure 8: Mixture component for the mixture model (A) and the developmental mixture model (B). (A) shows the posterior distribution of the mixture component and (B) shows developmental trajectories for the mixture component based on 300 random draws from the posterior distribution for intercept and slope.

Model comparison

We compared models based on correlations and Bayes factors. Figure 9 shows correlations between model predictions and the data, each binned by year and object. Even though model predictions and data are closely aligned for all models, the integration model shows the highest correlation in all age groups. Next we directly compared models based using Bayes factors (see Table 6). We also included the prediction integration model into this analysis.

Perhaps unsurprisingly, we see that updating the parameters for semantic knowledge, speaker informativeness and common ground sensitivity greatly improves the model fit (comparison: integration (explanation) vs integration (prediction)). We also see that the explanation integration model provides, by far, the best fit to the data compared to the two mixture models. Interestingly, the prediction integration model also had a better fit, even though its parameters were not updated based on the data from experiment 3. When comparing the two mixture models directly, we see that an age sensitive mixture parameter did not result in a substantially better fit.

This analysis shows that the inference and integration process described by the ingetration model accurately captures the data and also explains the integration process better compared to the additive mixture models. As a consequence, we may say that instead of being biased towards one type of inference, children are rationally integrating all the information sources available.

Explaining information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95\% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.

Figure 9: Explaining information integration. Correlations between model predictions and data binned by year, item and condition. Vertical and horizontal error bars show 95% HDI. Blue diamonds show congruent condition and red ones show the incongruent condition.

Table 6: Model comparison using Bayes factors computed based on the marginal likelihood of each model given the data.
Model comparison Bayes factor
integration (explanation) vs integration (prediction) 128748
integration (explanation) vs mixture 26028662
integration (explanation) vs developmental mixture 6280259
integration (prediction) vs mixture 202
integration (prediction) vs developmental mixture 49
developmental mixture vs mixture 4

Summary

Here we studied how 2 to 5 year old children integrate semantic and pragmatic information during word learning. In three experiments, we first showed that children make a mutual exclusivity inference and that this inference varied depending on children’s familiarity with the objects involved (experiment 1). Next, we showed that children make common ground inferences based on their interactions with a speaker (experiment 2). When the two inferences were combined, we found that children were sensitive to the way in which they were aligned (experiment 3).

We then introduced a computational model to investigate the process by which children integrated the inferences in experiment 3. As a start, we described mutual exclusivity as a pragmatic inference, which takes in children’s emerging semantic knowledge and their expectations about how informative a speaker is. The integration model assumes that this inference is then flexibly integrated with children’s developing sensitivity to common ground.

Next, we tested the predictive power of this model. That is, we asked how well the model would predict the data of experiment 3, when only knowing the developmental trajectories for mutual exclusivity (based on experiment 1) and common ground (experiment 2). We found a very close alignment of the model predictions and the data across the entire age range. Furthermore, the integration model provided a better fit to the data compared to a number of lesion models, which selectively omitted one type of information. This suggests that children flexibly integrate all available information.

In the final section, we studied which process best explained children’s information integration. We compared the integration model to a mixture model which assumed that children are biased towards one type of inference. We found that the integration model better explained the data compared to these alternative model. In sum, we found that children’s integration of semantic and pragmatic information during word learning is best described as a form of rational social inference.

Appendix: Model parameters

In the following, we visualize the model parameters for semantic knowledge, speaker informativeness and common ground sensitivity. Please note that the alternative lesion models presented in the prediction section used the same parameter distributions as the prediction integration model.

Semantic knowledge

Intercept

Posterior distribution of intercept term for semantic knowledge for each object by model.

Figure 10: Posterior distribution of intercept term for semantic knowledge for each object by model.

Slope

Posterior distribution of slope term for semantic knowledge for each object by model.

Figure 11: Posterior distribution of slope term for semantic knowledge for each object by model.

Speaker informativeness and common ground sensitivity

Posterior distribution of slope and intercept terms for speaker informativeness and sensitivity to common ground by model.

Figure 12: Posterior distribution of slope and intercept terms for speaker informativeness and sensitivity to common ground by model.

Akhtar, Nameera, Malinda Carpenter, and Michael Tomasello. 1996. “The Role of Discourse Novelty in Early Word Learning.” Child Development 67 (2). Wiley Online Library: 635–45.

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. doi:10.18637/jss.v080.i01.

Clark, Eve V. 1987. “The Principle of Contrast: A Constraint on Language Acquisition.” Lawrence Erlbaum Associates, Inc.

Diesendruck, Gil, Lori Markson, Nameera Akhtar, and Ayelet Reudor. 2004. “Two-Year-Olds’ Sensitivity to Speakers’ Intent: An Alternative Account of Samuelson and Smith.” Developmental Science 7 (1). Wiley Online Library: 33–41.

Frank, Michael C, and Noah D Goodman. 2012. “Predicting Pragmatic Reasoning in Language Games.” Science 336 (6084). American Association for the Advancement of Science: 998–98.

Frank, Michael C, Elise Sugarman, Alexandra C Horowitz, Molly L Lewis, and Daniel Yurovsky. 2016. “Using Tablets to Collect Data from Young Children.” Journal of Cognition and Development 17 (1). Taylor & Francis: 1–17.

Goodman, Noah D, and Michael C Frank. 2016. “Pragmatic Language Interpretation as Probabilistic Inference.” Trends in Cognitive Sciences 20 (11). Elsevier: 818–29.

Goodman, Noah D, and Andreas Stuhlmüller. 2014. “The design and implementation of probabilistic programming languages.” http://dippl.org.

Kuperman, Victor, Hans Stadthagen-Gonzalez, and Marc Brysbaert. 2012. “Age-of-Acquisition Ratings for 30,000 English Words.” Behavior Research Methods 44 (4). Springer: 978–90.

Lewis, Molly L, Veronica Cristiano, Brenden M. Lake, Tammy Kwan, and Michael C Frank. 2020. “The Role of Developmental Change and Linguistic Experience in the Mutual Exclusivity Effect.” Cognition 198: 104191.

Markman, Ellen M, and Gwyn F Wachtel. 1988. “Children’s Use of Mutual Exclusivity to Constrain the Meanings of Words.” Cognitive Psychology 20 (2). Elsevier: 121–57.

McElreath, Richard. 2016. Statistical rethinking: A bayesian course with examples in R and Stan. Texts in Statistical Science. Boca Raton: CRC Press.

Morey, Richard D., and Jeffrey N. Rouder. 2018. BayesFactor: Computation of Bayes Factors for Common Designs. https://CRAN.R-project.org/package=BayesFactor.